Daily weather observations from multiple locations around Australia, obtained from the Australian Commonwealth Bureau of Meteorology and processed to create this realtively large sample dataset for illustrating analytics, data mining, and data science using R and Rattle.The data has been processed to provide a target variable RainTomorrow (whether there is rain on the following day - No/Yes) and a risk variable RISK_MM (how much rain recorded in millimeters). Various transformations are performed on the data.The weatherAUS dataset is regularly updated an updates of this package usually correspond to updates to this dataset. The data is updated from the Bureau of Meteorology web site.The locationsAUS dataset records the location of each weather station.The source dataset comes from the Australian Commonwealth Bureau of Meteorology. The Bureau provided permission to use the data with the Bureau of Meteorology acknowledged as the source of the data, as per email from Cathy Toby (C.Toby@bom.gov.au) of the Climate Information Services of the National CLimate Centre, 17 Dec 2008.
The weatherAUS dataset is a data frame containing over 140,000 daily observations from over 45 Australian weather stations. Variables include :
Date : The date of observation (a Date object).
Location : The common name of the location of the weather station.
MinTemp : The minimum temperature in degrees celsius.
MaxTemp : The maximum temperature in degrees celsius.
Rainfall : The amount of rainfall recorded for the day in mm.
Evaporation: The so-called Class A pan evaporation (mm) in the 24 hours to 9am.
Sunshine : The number of hours of bright sunshine in the day.
WindGustDir : The direction of the strongest wind gust in the 24 hours to midnight.
WindGustSpeed : The speed (km/h) of the strongest wind gust in the 24 hours to midnight.
Temp9am : Temperature (degrees C) at 9am.
RelHumid9am : Relative humidity (percent) at 9am.
Cloud9am : Fraction of sky obscured by cloud at 9am. This is measured in “oktas”, which are a unit of eigths. It records how many eigths of the sky are obscured by cloud. A 0 measure indicates completely clear sky whilst an 8 indicates that it is completely overcast.
WindSpeed9am : Wind speed (km/hr) averaged over 10 minutes prior to 9am.
Pressure9am : Atmospheric pressure (hpa) reduced to mean sea level at 9am.
Temp3pm : Temperature (degrees C) at 3pm.
RelHumid3pm : Relative humidity (percent) at 3pm.
Cloud3pm : Fraction of sky obscured by cloud (in “oktas”: eighths) at 3pm. See Cload9am for a description of the values.
WindSpeed3pm : Wind speed (km/hr) averaged over 10 minutes prior to 3pm.
Pressure3pm : Atmospheric pressure (hpa) reduced to mean sea level at 3pm.
ChangeTemp : Change in temperature.
ChangeTempDir : Direction of change in temperature.
ChangeTempMag : Magnitude of change in temperature.
ChangeWindDirect : Direction of wind change.
MaxWindPeriod : Period of maximum wind.
RainToday : Integer: 1 if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise 0.
TempRange : Difference between minimum and maximum temperatures (degrees C) in the 24 hours to 9am.
PressureChange : Change in pressure.
RISK_MM : The amount of rain. A kind of measure of the “risk”.
RainTomorrow : The target variable. Did it rain tomorrow?
Author(s) Graham.Williams@togaware.com
Source Observations were drawn from numerous weather stations. The daily observations are available from https://www.bom.gov.au/climate/data. Copyright Commonwealth of Australia 2010, Bureau of Meteorology.
Definitions adapted from https://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml
References Package home page: https://rattle.togaware.com. Data source: https://www.bom.gov.au/climate/dwo/ and https://www.bom.gov.au/climate/data.
Evaluating fairness
This data comes from Australian Commonwealth Bureau of Meteorology which I belive is responsible for providing statitistically efficient predictions and correct meteorological updates to the Austrialian people. In our case, predicting 3pm temperatures, could help people in general stay safe or plan trips. For Businesses, it could them plan their business accordingly, in specific it might help ice cream businesses, tourist businesses, ridesharing businesses etc as they can plan their business according to temperatue. So, It definitely benefits the Australian people.
Since, this data was collected/recorded by taking temperatures, windspeed, pressure change, etc I don’t think it involved any animal testing or human testing etc which would have raised ethical concerns, so I think most probably no body was harmed. For our case, predicting 3pm temp,I don’t think anyone was harmed. Therefore, this data is ethical and safe to use.
Project Purpose:
Assuming the role of a meteorologist in Australia, I wanted to improve the accuracy of afternoon temperature forecasts for the six major cities—Hobart, Adelaide, Canberra, Brisbane, Melbourne, and Sydney. So, the key challenge for me was to predict the 3 p.m. temperature with enough precision to create value across many layers of society—individuals, businesses, governments, and even international stakeholders influencing health, safety, commerce, infrastructure planning, and international operations— ultimately benefiting everyone from an Australian family planning an afternoon picnic to a global airline plotting a trans-Pacific route.
Project Description
I built a Machine Learning pipeline (in R) to predict the 3pm temperatures and my goal was to demonstrate a complete end-to-end forecasting workflow through this project. So here are the very brief steps of the project.
Data Prep: Filtered six cities, converted temps to °F, selected key predictors (Temp9am, Location, Pressure9am, WindSpeed9am, Humidity9am), and explored patterns with ggplot2.
Models: Model 1: Temp3pm ~ Temp9am + Location + Pressure9am → R² ≈ 0.76, MAE ≈ 4 °F. Model 2: Added WindSpeed9am + Humidity9am → R² ≈ 0.77, lower MAE. Used R’s tidymodel library for regression analysis.
Validation: Utilized 10-fold cross-validation algorithm to validate/check for overfitting as it provides robust, low-variance accuracy estimates—preferred over a single train/test split. Also performed fold-by-fold validation.
Findings :
Temp9am was the strongest predictor; when controlling for location and pressure at 9am, we expect 3pm temperatures to roughly increase by 0.95 degree F for every 1 degree F increase in 9am temperature.
Adelaide was chosen as refrence variable, any location was intrepreted in refrence to Adelaide. Ex : For Hobart, when controlling for temperature at 9am, pressure at 9 am and other locations, we expect temperature at 3pm in Hobart to be 1.29 degrees F lower than Adelaide on average
filtering 6 rows namely : Hobart, Adelaide, Canberra, Brisbane, Melbourne, and Sydney. Mutating(in this case writing over) temp9am values to be farenheit and then selecting the relevant 6 columns
# Checking it outdim(weather_data)## [1] 26828 6head(weather_data)## # A tibble: 6 × 6## Temp3pm Location WindSpeed9am Humidity9am Pressure9am Temp9am## <dbl> <chr> <dbl> <int> <dbl> <dbl>## 1 69.6 Sydney 17 92 1018. 69.3## 2 76.6 Sydney 9 83 1018. 72.3## 3 73.4 Sydney 17 88 1017. 74.3## 4 69.6 Sydney 22 83 1014. 70.5## 5 77.9 Sydney 11 88 1008. 72.5## 6 78.8 Sydney 9 69 1003. 74.8
Code
# Arranging it in desending order and then looking at top 3 using head function proving 3 as an argumenttop_3_hottest <- weather_data %>%arrange(desc(Temp3pm)) %>%head(3)
# group_by() makes mini groups of every location and then takes mean of the group based on temp3pmtemp3pm_6locs <- weather_data %>%group_by(Location) %>%summarise(mean(Temp3pm, na.rm =TRUE))
Code
# lets check it out temp3pm_6locs## # A tibble: 6 × 2## Location `mean(Temp3pm, na.rm = TRUE)`## <chr> <dbl>## 1 Adelaide 70.8## 2 Brisbane 76.7## 3 Canberra 67.0## 4 Hobart 61.2## 5 Melbourne 66.6## 6 Sydney 70.9
Code
# Density plot of temp3p for 6 locationsggplot(weather_data, aes(x = Temp3pm, fill = Location)) +geom_density(alpha =0.6) # alpha adds translucency to see overlap
Code
# creating subplot for each city# facet_wrap() -- basically splits one plot into smaller sub plots based on similar values in a column ggplot(weather_data, aes(x = Temp9am, y = Temp3pm, color = Humidity9am)) +geom_point() +facet_wrap(~Location) +scale_color_gradient(low ="#132B43", high ="#56B1F7")
Interpreting both the plots.
Interpretation : Part-d (a) — We’re comparing 3pm temperature of the 6 cities and its variability(how much its varying). Brisbane’s afternoons seems to be in general little hotter relative to other cities as its skewed towards right, higher temp. Similarly Canberra’s afternoons seems pretty cool relative to other cities as its skewed towards left. Apart from the extremes, Hobart, Adeliade and Melbourne seems to have overlap and similar temps more skewed towards cold and sydney seems to have most mid-range temperature of them all.
Interpretation : Part-d (b) — In this facet’s scatter plot, we’re comparing humidity of 9 am for different cities. We can see that Adelaide show’s linearly decreasing humidity at 9 am when temperatures of 9am and temperatur of 3pm are hotter. So if morning in Adelaide starts warmer and stays warmer till after, then humidity at 9 am, with certain probablity will be lower. Canberra, Hobart, Melbourne, Canberra and Syndey shows stable higher humidty at 9am, and in all of them at the peak warmer tempertures, humidity seems to drop a bit.
Mode_1 Building
Code
# Specifying the modellm_spec <-linear_reg() |>set_mode("regression") |>set_engine("lm")# checking it outlm_spec## Linear Regression Model Specification (regression)## ## Computational engine: lm
Code
# fitting the modelweather_model_1 <-lm_spec |>fit(Temp3pm ~ Temp9am + Location + Pressure9am, data = weather_data)
# Obtaining the predictions + residuals with augment function as it gives both preidctions and residuals for every rowweather_model_1 %>%augment(new_data = weather_data) %>%head(10)## # A tibble: 10 × 8## .pred .resid Temp3pm Location WindSpeed9am Humidity9am Pressure9am Temp9am## <dbl> <dbl> <dbl> <chr> <dbl> <int> <dbl> <dbl>## 1 75.9 -6.29 69.6 Sydney 17 92 1018. 69.3## 2 78.9 -2.22 76.6 Sydney 9 83 1018. 72.3## 3 80.6 -7.16 73.4 Sydney 17 88 1017. 74.3## 4 76.6 -6.96 69.6 Sydney 22 83 1014. 70.5## 5 77.5 0.351 77.9 Sydney 11 88 1008. 72.5## 6 78.9 -0.108 78.8 Sydney 9 69 1003. 74.8## 7 74.7 -2.60 72.1 Sydney 15 75 999 71.1## 8 71.4 -1.41 70.0 Sydney 7 77 1008. 66.0## 9 68.0 -6.32 61.7 Sydney 19 92 1006. 62.8## 10 69.4 4.58 73.9 Sydney 11 80 1014 63.0
Code
# creating the residual plot for weather model 1 weather_model_1 %>%augment(new_data = weather_data) %>%ggplot(aes(x = .pred, y = .resid)) +geom_point() +geom_hline(yintercept =0) # coz we want to see where our prediction values fall - below or above 0
It looks like the model is random, I don’t see any particular shape/ pattern like a slope or parabola etc.This shows that the model is correct as it fits the assumptions of Linear regression. The points seem to be heavily scattered in the middle than tails and there seems to some be outliers but not strong enough to make our model incorrect.
Interpretaion of the R^2 : The weather_model_1 seems strong as indicated by high R^2 of 0.75687. So the predictor variables – temperature at 9am, Location and pressure at 9am explains approximately 76% of variability in our response varaible – temperature at 3pm. Although it’s not very strong as roughly 24% of variability is still not explained, but in general it seems strong.
Code
# cheking the accuracy -- how far my predictions are off weather_model_1 %>%augment(new_data = weather_data) %>%summarise(mae =mean(abs(.resid), na.rm =TRUE)) # had to remove the missing values as it was giving mae as na## # A tibble: 1 × 1## mae## <dbl>## 1 4.24
Interpretation of MAE : The mean absolute error is roughly 4, which means that the predictions are off by 4 degrees F for 3pm temperatures. In relative to the scale of the data, which range from 40 degrees F to 120 degress F for 3pm, being 4 degrees F off in range of 80 degrees F (120-40), is 5% (4/80*100). Therefore, the model is 5% off, which means its statistically 95% accourate.
Interpretation of Temp9am and LocationHobart coefficients :
Temp9am coefficient interpretation : When controlling for location and pressure at 9am, we expect 3pm temperatures to roughly increase by 0.95 degree F for every 1 degree F increase in 9am temperature.
LocationHobart interpretation : When controlling for temperature at 9am, pressure at 9 am and other locations, we expect temperature at 3pm in Hobart to be 1.29 degrees F lower than Adelaide on average. ( Adelaide is our reference variable)
Interpretation of Model_1 and Model_2’s strength : As per sample metrics, Model_2 seems to be better than Model_1, although we will check throughly with k-fold validation before accepting it as better
Applying K-fold CV Validation Algorithm to both models to check for over-fitting
Code
# Conducting the 10 fold cross validation on model_1 set.seed(253)weather_model_1_cv <- lm_spec %>%fit_resamples(Temp3pm ~ Temp9am + Location + Pressure9am, resamples =vfold_cv(weather_data, v =10),metrics =metric_set(mae, rsq))weather_model_1_cv %>%collect_metrics()## # A tibble: 2 × 6## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 mae standard 4.25 10 0.0216 pre0_mod0_post0## 2 rsq standard 0.757 10 0.00165 pre0_mod0_post0
Code
set.seed(253) # setting seed now for reproducibility as k-fold cv algo splits randomly everytime we run codeweather_model_2_cv <- lm_spec %>%fit_resamples(Temp3pm ~ Temp9am + Location + Pressure9am+ Humidity9am+ WindSpeed9am, resamples =vfold_cv(weather_data, v =10),metrics =metric_set(mae, rsq))weather_model_2_cv %>%collect_metrics()## # A tibble: 2 × 6## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 mae standard 4.12 10 0.0242 pre0_mod0_post0## 2 rsq standard 0.769 10 0.00149 pre0_mod0_post0
Lets check fold by fold to see if there’s any over-fitting, just to be sure before comparing. Any unusual variance in folds will reveal overfitting
Code
# model 1 fold by fold cvweather_model_1_cv %>%unnest(.metrics) %>%filter(.metric =="mae")## # A tibble: 10 × 7## splits id .metric .estimator .estimate .config .notes ## <list> <chr> <chr> <chr> <dbl> <chr> <list> ## 1 <split [24145/2683]> Fold01 mae standard 4.36 pre0_mod0_… <tibble>## 2 <split [24145/2683]> Fold02 mae standard 4.24 pre0_mod0_… <tibble>## 3 <split [24145/2683]> Fold03 mae standard 4.23 pre0_mod0_… <tibble>## 4 <split [24145/2683]> Fold04 mae standard 4.22 pre0_mod0_… <tibble>## 5 <split [24145/2683]> Fold05 mae standard 4.14 pre0_mod0_… <tibble>## 6 <split [24145/2683]> Fold06 mae standard 4.31 pre0_mod0_… <tibble>## 7 <split [24145/2683]> Fold07 mae standard 4.25 pre0_mod0_… <tibble>## 8 <split [24145/2683]> Fold08 mae standard 4.17 pre0_mod0_… <tibble>## 9 <split [24146/2682]> Fold09 mae standard 4.21 pre0_mod0_… <tibble>## 10 <split [24146/2682]> Fold10 mae standard 4.32 pre0_mod0_… <tibble>
Code
# model 2 fold by fold cvweather_model_2_cv %>%unnest(.metrics) %>%filter(.metric =="mae")## # A tibble: 10 × 7## splits id .metric .estimator .estimate .config .notes ## <list> <chr> <chr> <chr> <dbl> <chr> <list> ## 1 <split [24145/2683]> Fold01 mae standard 4.26 pre0_mod0_… <tibble>## 2 <split [24145/2683]> Fold02 mae standard 4.13 pre0_mod0_… <tibble>## 3 <split [24145/2683]> Fold03 mae standard 4.09 pre0_mod0_… <tibble>## 4 <split [24145/2683]> Fold04 mae standard 4.11 pre0_mod0_… <tibble>## 5 <split [24145/2683]> Fold05 mae standard 3.99 pre0_mod0_… <tibble>## 6 <split [24145/2683]> Fold06 mae standard 4.21 pre0_mod0_… <tibble>## 7 <split [24145/2683]> Fold07 mae standard 4.10 pre0_mod0_… <tibble>## 8 <split [24145/2683]> Fold08 mae standard 4.07 pre0_mod0_… <tibble>## 9 <split [24146/2682]> Fold09 mae standard 4.10 pre0_mod0_… <tibble>## 10 <split [24146/2682]> Fold10 mae standard 4.16 pre0_mod0_… <tibble>
Conclusion
So as per the each fold’s mae’s of both weather_model_1 and weather_model_2 shows no overfitting as there’s no extreme fluctuations which shows that both models are generalizing well to unseen data. Comparatively, based on CV metrics, weather_model_2 is better in predicting for 3pm temperatures as it has lower MAE of 4.12 than weather_model_1’s 4.24 MAE and higher R^2 of 0.768 than weather_model_1’s 0.756.
---title: "Forecasting 3 PM Temperatures Across Six Australian Cities"subtitle: "Statistical Machine Learning Project"author: "Mohammed Sohail Khan "date: 21 Sep 2025format: html: toc: true toc-depth: 2 embed-resources: true fig-width: 5 fig-height: 3execute: error: true warning: falseknitr: opts_chunk: collapse: true message: false---# Data ContextDaily weather observations from multiple locations around Australia, obtained from the Australian Commonwealth Bureau of Meteorology and processed to create this realtively large sample dataset for illustrating analytics, data mining, and data science using R and Rattle.The data has been processed to provide a target variable RainTomorrow (whether there is rain on the following day - No/Yes) and a risk variable RISK_MM (how much rain recorded in millimeters). Various transformations are performed on the data.The weatherAUS dataset is regularly updated an updates of this package usually correspond to updates to this dataset. The data is updated from the Bureau of Meteorology web site.The locationsAUS dataset records the location of each weather station.The source dataset comes from the Australian Commonwealth Bureau of Meteorology. The Bureau provided permission to use the data with the Bureau of Meteorology acknowledged as the source of the data, as per email from Cathy Toby (C.Toby@bom.gov.au) of the Climate Information Services of the National CLimate Centre, 17 Dec 2008.You can [download the CSV here](https://rattle.togaware.com/weatherAUS.csv) for analysis.The weatherAUS dataset is a data frame containing over 140,000 daily observations from over 45 Australian weather stations. Variables include :- Date :The date of observation (a Date object).- Location :The common name of the location of the weather station.- MinTemp :The minimum temperature in degrees celsius.- MaxTemp :The maximum temperature in degrees celsius.- Rainfall :The amount of rainfall recorded for the day in mm.- Evaporation:The so-called Class A pan evaporation (mm) in the 24 hours to 9am.- Sunshine :The number of hours of bright sunshine in the day.- WindGustDir :The direction of the strongest wind gust in the 24 hours to midnight.- WindGustSpeed :The speed (km/h) of the strongest wind gust in the 24 hours to midnight.- Temp9am :Temperature (degrees C) at 9am.- RelHumid9am :Relative humidity (percent) at 9am.- Cloud9am :Fraction of sky obscured by cloud at 9am. This is measured in "oktas", which are a unit of eigths. It records how many eigths of the sky are obscured by cloud. A 0 measure indicates completely clear sky whilst an 8 indicates that it is completely overcast.- WindSpeed9am :Wind speed (km/hr) averaged over 10 minutes prior to 9am.- Pressure9am :Atmospheric pressure (hpa) reduced to mean sea level at 9am.- Temp3pm :Temperature (degrees C) at 3pm.- RelHumid3pm :Relative humidity (percent) at 3pm.- Cloud3pm :Fraction of sky obscured by cloud (in "oktas": eighths) at 3pm. See Cload9am for a description of the values.- WindSpeed3pm :Wind speed (km/hr) averaged over 10 minutes prior to 3pm.- Pressure3pm :Atmospheric pressure (hpa) reduced to mean sea level at 3pm.- ChangeTemp :Change in temperature.- ChangeTempDir :Direction of change in temperature.- ChangeTempMag :Magnitude of change in temperature.- ChangeWindDirect :Direction of wind change.- MaxWindPeriod :Period of maximum wind.- RainToday :Integer: 1 if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise 0.- TempRange :Difference between minimum and maximum temperatures (degrees C) in the 24 hours to 9am.- PressureChange :Change in pressure.- RISK_MM :The amount of rain. A kind of measure of the "risk".- RainTomorrow :The target variable. Did it rain tomorrow?Author(s)Graham.Williams@togaware.comSourceObservations were drawn from numerous weather stations. The daily observations are available from https://www.bom.gov.au/climate/data. Copyright Commonwealth of Australia 2010, Bureau of Meteorology.Definitions adapted from https://www.bom.gov.au/climate/dwo/IDCJDW0000.shtmlReferencesPackage home page: https://rattle.togaware.com. Data source: https://www.bom.gov.au/climate/dwo/ and https://www.bom.gov.au/climate/data.# Evaluating fairness This data comes from Australian Commonwealth Bureau of Meteorology which I belive is responsible for providing statitistically efficient predictions and correct meteorological updates to the Austrialian people. In our case, predicting 3pm temperatures, could help people in general stay safe or plan trips. For Businesses, it could them plan their business accordingly, in specific it might help ice cream businesses, tourist businesses, ridesharing businesses etc as they can plan their business according to temperatue. So, It definitely benefits the Australian people.Since, this data was collected/recorded by taking temperatures, windspeed, pressure change, etc I don't think it involved any animal testing or human testing etc which would have raised ethical concerns, so I think most probably no body was harmed. For our case, predicting 3pm temp,I don't think anyone was harmed. Therefore, this data is ethical and safe to use.# Project Purpose:Assuming the role of a meteorologist in Australia, I wanted to improve the accuracy of afternoon temperature forecasts for the six major cities—Hobart, Adelaide, Canberra, Brisbane, Melbourne, and Sydney. So, the key challenge for me was to predict the 3 p.m. temperature with enough precision to create value across many layers of society—individuals, businesses, governments, and even international stakeholders influencing health, safety, commerce, infrastructure planning, and international operations— ultimately benefiting everyone from an Australian family planning an afternoon picnic to a global airline plotting a trans-Pacific route. # Project Description I built a Machine Learning pipeline (in R) to predict the 3pm temperatures and my goal was to demonstrate a complete end-to-end forecasting workflow through this project. So here are the very brief steps of the project. 1. Data Prep:Filtered six cities, converted temps to °F, selected key predictors (Temp9am, Location, Pressure9am, WindSpeed9am, Humidity9am), and explored patterns with ggplot2.2. Models: Model 1: Temp3pm ~ Temp9am + Location + Pressure9am → R² ≈ 0.76, MAE ≈ 4 °F.Model 2: Added WindSpeed9am + Humidity9am → R² ≈ 0.77, lower MAE.Used R's tidymodel library for regression analysis.3. Validation: Utilized 10-fold cross-validation algorithm to validate/check for overfitting as it provides robust, low-variance accuracy estimates—preferred over a single train/test split. Also performed fold-by-fold validation.4. Findings : - Temp9am was the strongest predictor; when controlling for location and pressure at 9am, we expect 3pm temperatures to roughly increase by 0.95 degree F for every 1 degree F increase in 9am temperature.- Adelaide was chosen as refrence variable, any location was intrepreted in refrence to Adelaide.Ex : For Hobart, when controlling for temperature at 9am, pressure at 9 am and other locations, we expect temperature at 3pm in Hobart to be 1.29 degrees F lower than Adelaide on average## Setup Loading packages and preping data```{r}library(tidyverse)library(tidymodels)library(rattle)data("weatherAUS")glimpse(weatherAUS)view(weatherAUS)```## Data wrangling and Visualizationfiltering 6 rows namely : Hobart, Adelaide, Canberra, Brisbane, Melbourne, and Sydney. Mutating(in this case writing over) temp9am values to be farenheitand then selecting the relevant 6 columns```{r}weather_data <-weatherAUS %>%filter(Location %in%c("Hobart", "Adelaide", "Canberra", "Brisbane", "Melbourne", "Sydney") ) %>%mutate(Temp9am = Temp9am *1.8+32, Temp3pm = Temp3pm *1.8+32) %>%select(Temp3pm, Location, WindSpeed9am, Humidity9am, Pressure9am, Temp9am)``````{r}# Checking it outdim(weather_data)head(weather_data)``````{r}# Arranging it in desending order and then looking at top 3 using head function proving 3 as an argumenttop_3_hottest <- weather_data %>%arrange(desc(Temp3pm)) %>%head(3)``````{r}# checking it out top_3_hottest``````{r}# group_by() makes mini groups of every location and then takes mean of the group based on temp3pmtemp3pm_6locs <- weather_data %>%group_by(Location) %>%summarise(mean(Temp3pm, na.rm =TRUE))``````{r}# lets check it out temp3pm_6locs``````{r}# Density plot of temp3p for 6 locationsggplot(weather_data, aes(x = Temp3pm, fill = Location)) +geom_density(alpha =0.6) # alpha adds translucency to see overlap``````{r}# creating subplot for each city# facet_wrap() -- basically splits one plot into smaller sub plots based on similar values in a column ggplot(weather_data, aes(x = Temp9am, y = Temp3pm, color = Humidity9am)) +geom_point() +facet_wrap(~Location) +scale_color_gradient(low ="#132B43", high ="#56B1F7") ```### Interpreting both the plots.Interpretation : Part-d (a) --- We're comparing 3pm temperature of the 6 cities and its variability(how much its varying). Brisbane's afternoons seems to be in general little hotter relative to other cities as its skewed towards right, higher temp. Similarly Canberra's afternoons seems pretty cool relative to other cities as its skewed towards left. Apart from the extremes, Hobart, Adeliade and Melbourne seems to have overlap and similar temps more skewed towards cold and sydney seems to have most mid-range temperature of them all. Interpretation : Part-d (b) --- In this facet's scatter plot, we're comparing humidity of 9 am for different cities. We can see that Adelaide show's linearly decreasing humidity at 9 am when temperatures of 9am and temperatur of 3pm are hotter. So if morning in Adelaide starts warmer and stays warmer till after, then humidity at 9 am, with certain probablity will be lower. Canberra, Hobart, Melbourne, Canberra and Syndey shows stable higher humidty at 9am, and in all of them at the peak warmer tempertures, humidity seems to drop a bit.## Mode_1 Building```{r}# Specifying the modellm_spec <-linear_reg() |>set_mode("regression") |>set_engine("lm")# checking it outlm_spec``````{r}# fitting the modelweather_model_1 <-lm_spec |>fit(Temp3pm ~ Temp9am + Location + Pressure9am, data = weather_data)``````{r}# checking it out weather_model_1 |>tidy()``````{r}# Obtaining the predictions + residuals with augment function as it gives both preidctions and residuals for every rowweather_model_1 %>%augment(new_data = weather_data) %>%head(10)``````{r}# creating the residual plot for weather model 1 weather_model_1 %>%augment(new_data = weather_data) %>%ggplot(aes(x = .pred, y = .resid)) +geom_point() +geom_hline(yintercept =0) # coz we want to see where our prediction values fall - below or above 0```It looks like the model is random, I don't see any particular shape/ pattern like a slope or parabola etc.This shows that the model is correct as it fits the assumptions of Linear regression. The points seem to be heavily scattered in the middle than tails and there seems to some be outliers but not strong enough to make our model incorrect.```{r}# getting the metric to evaluate the model's strengthweather_model_1 %>%glance()```Interpretaion of the R^2 : The weather_model_1 seems strong as indicated by high R^2 of 0.75687. So the predictor variables -- temperature at 9am, Location and pressure at 9am explains approximately 76% of variability in our response varaible -- temperature at 3pm. Although it's not very strong as roughly 24% of variability is still not explained, but in general it seems strong. ```{r}# cheking the accuracy -- how far my predictions are off weather_model_1 %>%augment(new_data = weather_data) %>%summarise(mae =mean(abs(.resid), na.rm =TRUE)) # had to remove the missing values as it was giving mae as na```Interpretation of MAE : The mean absolute error is roughly 4, which means that the predictions are off by 4 degrees F for 3pm temperatures. In relative to the scale of the data, which range from 40 degrees F to 120 degress F for 3pm, being 4 degrees F off in range of 80 degrees F (120-40), is 5% (4/80*100). Therefore, the model is 5% off, which means its statistically 95% accourate. Interpretation of Temp9am and LocationHobart coefficients : Temp9am coefficient interpretation : When controlling for location and pressure at 9am, we expect 3pm temperatures to roughly increase by 0.95 degree F for every 1 degree F increase in 9am temperature. LocationHobart interpretation : When controlling for temperature at 9am, pressure at 9 am and other locations, we expect temperature at 3pm in Hobart to be 1.29 degrees F lower than Adelaide on average. ( Adelaide is our reference variable)## Model_2 Building```{r}# Fitting Weather_model_2 with all predictor variablesweather_model_2 <- lm_spec %>%fit(Temp3pm ~ Temp9am + Location + Pressure9am + WindSpeed9am + Humidity9am, data = weather_data)weather_model_2 %>%tidy()``````{r}# Obtaining prediction and residuals for weather_model_2weather_model_2 %>%augment(new_data = weather_data) %>%head(10)``````{r}# creating the residual plot for weather_model_2weather_model_2%>%augment(new_data = weather_data) %>%ggplot(aes(x = .pred, y = .resid)) +geom_point() +geom_hline(yintercept =0)``````{r}# Looking for R^2 for weather_model2 weather_model_2 %>%glance()``````{r}# checking for accuracy weather_model_2 %>%augment(new_data = weather_data) %>%summarise(mae =mean(abs(.resid), na.rm =TRUE))```Interpretation of Model_1 and Model_2's strength : As per sample metrics, Model_2 seems to be better than Model_1, although we will check throughly with k-fold validation before accepting it as better## Applying K-fold CV Validation Algorithm to both models to check for over-fitting```{r}# Conducting the 10 fold cross validation on model_1 set.seed(253)weather_model_1_cv <- lm_spec %>%fit_resamples(Temp3pm ~ Temp9am + Location + Pressure9am, resamples =vfold_cv(weather_data, v =10),metrics =metric_set(mae, rsq))weather_model_1_cv %>%collect_metrics()``````{r}set.seed(253) # setting seed now for reproducibility as k-fold cv algo splits randomly everytime we run codeweather_model_2_cv <- lm_spec %>%fit_resamples(Temp3pm ~ Temp9am + Location + Pressure9am+ Humidity9am+ WindSpeed9am, resamples =vfold_cv(weather_data, v =10),metrics =metric_set(mae, rsq))weather_model_2_cv %>%collect_metrics()``` Lets check fold by fold to see if there's any over-fitting, just to be sure before comparing. Any unusual variance in folds will reveal overfitting```{r}# model 1 fold by fold cvweather_model_1_cv %>%unnest(.metrics) %>%filter(.metric =="mae")``````{r}# model 2 fold by fold cvweather_model_2_cv %>%unnest(.metrics) %>%filter(.metric =="mae")```## ConclusionSo as per the each fold's mae's of both weather_model_1 and weather_model_2 shows no overfitting as there's no extreme fluctuations which shows that both models are generalizing well to unseen data. Comparatively, based on CV metrics, weather_model_2 is better in predicting for 3pm temperatures as it has lower MAE of 4.12 than weather_model_1's 4.24 MAE and higher R^2 of 0.768 than weather_model_1's 0.756.